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CN ■ The classical Bayesian posterior arises naturally as the unique solution of several dif- 

H . ferent optimization problems, without the necessity of interpreting data as conditional 

probabilities and then using Bayes' Theorem. For example, the classical Bayesian 
posterior is the unique posterior that minimizes the loss of Shannon information in 
combining the prior and the likelihood distributions. These results, direct corollar- 
ies of recent results about conflations of probability distributions, reinforce the use of 

^ ' Bayesian posteriors, and may help partially reconcile some of the differences between 

^^ . classical and Bayesian statistics. 

-(— > 

1 Introduction 

In statistics, prior belief about the value of an unknown parameter, 6' G O C M" obtained 
from experiments or other methods, is often expressed as a Borel probability distribution Pq 
^y^ . on B C M" called the prior distribution. New evidence or information about the value of 6, 

CN ! based on an independent experiment or survey, is recorded as a likelihood distribution L. Here 

^ I and throughout it will be assumed that the likelihood function has finite positive total mass, 

w ' and that L has been normalized, so that in fact L is also a Borel probability distribution on 

^ ■ 0. Given the prior distribution Pq and the likelihood distribution L, a posterior distribution 

Pi = Pi{Po, L) for 6 incorporates the new likelihood information about 6 into the information 
from the prior, thus updating the prior. The posterior distribution Pi is typically viewed as 
^ I the conditional distribution of 9 given the new likelihood information, often expressed as a 

H ' random variable X. 

' ' ' The first main goal of this note is to use recent results for conflations of probability 

distributions [3l H] to show that the Bayesian posterior is the unique posterior that minimizes 
the loss of Shannon information in combining the prior and likelihood distributions. The 
Bayesian posterior is also the unique posterior that attains the minimax likelihood ratio 
of the prior and likelihood distributions, and the unique posterior that is a proportional 
consolidation of the prior and likelihood distributions. Thus, the classical Bayesian posterior 
appears naturally as the solution of several different optimization problems, without the 
necessity of interpreting likelihood as a conditional probability and then invoking Bayes 
Theorem. These results reinforce the use of Bayesian posteriors, and may help partially 
reconcile some of the differences between classical statistics and Bayesian statistics. 

The second main goal of this note, another direct corollary of recent results for conflations 
of probability distributions [1], is to identify the best posterior when the prior and likeli- 
hood distributions are not weighted equally, such as in cases when the prior distribution is 
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given more weight than the hkehhood distribution. This new weighted posterior, the unique 
distribution that minimizes the loss of weighted Shannon information, coincides with the 
classical Bayesian posterior if the prior and likelihood are weighted equally, but in general is 
different. 

2 Combining Priors and Likelihoods into Posteriors 

There are many different methods for combining several probability distributions (e.g., see [H 
[3]), and in particular, for combining the prior distribution Pq and the likelihood distribution 
L into a single posterior distribution Pi = Pi{Po, L). For example, the prior and likelihoods 
could simply be averaged, i.e. Pi = ^°2^ , or the data underlying the prior and the likelihood 
could be averaged, in which case the posterior Pi would be the distribution of ^o+Xl ^ ^^ere 
Xq and Xl are independent random variables with distributions Pq and L, respectively. 

In Bayesian statistics, the likelihood function L is usually interpreted as L{6) = aP{X \ 
6), where X is the independent experiment or random variable yielding new information 
about 6, and a is the normalizing constant for L to have mass one (cf. [2]). The Bayesian 
posterior distribution Pb is then calculated using Bayes Theorem: for example, if Pq and 
L are discrete with probabihty mass functions (p.m.f.'s) po and pl respectively, then Pb is 
discrete with p.m.f. 

Pb\P) — — — ^ , 
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and if Pq and L are absolutely continuous with probability density functions (p.d.f.'s) /o and 
fi respectively, then Pb is absolutely continuous with p.d.f. 
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(provided the denominators are positive and finite). 

3 Minimizing Loss of Shannon Information 

When the goal is to consolidate information from a prior distribution and a likelihood dis- 
tribution into a (posterior) distribution, replacing those two distributions by a single distri- 
bution will clearly result in some loss of information, however that is defined. Recall that 
the classical Shannon information (also called the self-information or surprisal) associated 
with the event A from a probability distribution P, Sp{A), is given by Sp{A) = — log2 P{A) 
(so the smaller the value of P{A), the greater the information or surprise). The numerical 
value of the Shannon information of a given probability is simply the number of binary bits 
of information reflected in that probability. 

Example 3.1. If P is uniformly distributed on (0,1) and A = (0,0.25) U (0.5,0.75), then 
Sp{A) = — log2(P(v4)) = — log2(0.5) = 1, so if X is a random variable with distribution P, 
then exactly one binary bit of information is obtained by observing that X & A, in this case 
that the value of the second binary digit of X is 0. 



Definition 3.2. The combined Shannon Information associated with the event A from the 
prior distribution Pq and the likelihood distribution L is 

S{p,,L}iA) = SpM) + S^iA) = -\og,PoiA)LiA), 

and the maximum loss between the Shannon Information of a posterior distribution Pi 
and the combined Shannon information of the prior and likelihood distributions Pq and 
L,M{Pi;Po,L),is 

M{Pi- Po, L) = max {S{p,,l){A) - SpM)] = max jlog^ ^T^j^} • 

Note that the definition of combined Shannon information implicitly assumes indepen- 
dence of the prior and likelihood distributions. Note also that no information is obtained by 
observing an event that is certain to occur, so for instance S'[Po,l](6) = Sp^{Q) = 0. This 
implies that M(Pi; Pq, L) is never negative. 

Definition 3.3. A prior distribution Pq and a likelihood distribution L are compatible if Pq 
and L are both discrete with p.m.f's po and pi satisfying J2eeePoi^)PL{^) > 0' ^^ are both 
absolutely continuous with p.d.f.'s /o and fi satisfying < J^ /o(^)/l(^)'^^ < oo. 

Example 3.4. Every two geometric distributions are compatible, every two normal distri- 
butions are compatible, and every exponential distribution is compatible with every normal 
distribution. Distributions with disjoint support, discrete or continuous, are not compatible. 

Remark. In practice, compatibility is not problematic. Any two distributions may be easily 
transformed into two new distributions, arbitrarily close to the original distributions, so that 
the two new distributions are compatible, for instance by convolving each with a U{—e, e) 
distribution. 

Theorem 3.5. Let Pq and L be discrete compatible prior and likelihood distributions. Then 
the Bayesian posterior Pp is the unique posterior distribution that minimizes the maximum 
loss of Shannon information from the prior and likelihood distributions, i.e., that minimizes 
M{Pi; Pq,L) among all posterior distributions Pi. Moreover, 



M(Pi;Po,L)>log2 
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for all posterior distributions Pi 
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and equality is uniquely attained by the Bayesian posterior Pi = Pp . 

The conclusion of Theorem 13.51 follows immediately as a special case of [31 Corollary 4.4]; 
analogous conclusions for the case of compatible absolutely continuous distributions follow 
from [3, Theorem 4.5]. For the benefit of the reader, a sketch of the proof of Theorem 13.51 
similar to that in [4j is included. 

Sketch of proof. First observe that for an event A, the difference between the combined 
Shannon information obtained from a prior distribution Pq and a likelihood distribution L, 
and the Shannon information obtained from the posterior Pi, is 

S{p,,p}{A) - SpM) = SpM) + Sl{A) - SpM) = log2 pJa)L{A) ' 



Since log2(a;) is strictly increasing, the maximum (loss) thus occurs for an event A where 

Po5)L(A) is maximized. 

Next note that the largest loss of Shannon information occurs for small sets A, since for 
disjoint sets A and B, 

PijAUB) ^ PM) + Pi{B) f PM) P,{B) \ 

PoiA U B)LiA UB)- P,{A)L{A) + P,{B)L{B) " \ Po{A)L{A) ' Po{B)L{B) / ' 

where the inequalities follow from the inequalities (a + h){c + d) > ac + hd and ^^ < 
max {f , ^} for positive numbers a, 6, c, d. Thus the problem reduces to finding the proba- 
bility mass function p that makes the maximum, over all real values 6, of the ratio — Z\ ia\ 
as small as possible. But the minimum over all nonnegative gi, . . . , g„ with gi + ■ ■ ■ + g„ = 1 
of the maximum of —,..., — occurs when 3i = . . . = 3ii ([f they are not equal, reducing the 
numerator of the largest ratio, and increasing that of the smallest, will make the maximum 
smaller). Thus the p that makes the maximum of ,g| ' .^s as small as possible is when 
p{9) = cpo{9)pl{9), where c is chosen to make p a probability mass function, i.e., to make 
p{6) sum to 1. But this is exactly the definition of the Bayesian posterior Pb in the discrete 
case. D 



4 Minimax Likelihood Ratios 

In classical hypotheses testing, a standard technique to decide from which of several known 
distributions given data actually came is to maximize the likelihood ratios, that is, the ratios 
of the p.m.f.'s or p.d.f.'s. Analogously, when the objective is to decide how best to consolidate 
a prior distribution Pq and a likelihood distribution L into a single (posterior) distribution 
Pi = Pi{Po, L), one natural criterion is to choose Pi so as to make the ratios of the likelihood 
of observing 6 under Pi as close as possible to the likelihood of observing 6 under both the 
prior distribution Pq and the likelihood distribution L. This motivates the following notion 
of minimax likelihood ratio posterior. 

Definition 4.1. A discrete probability distribution P* (with p.m.f. p*) is the minimax 
likelihood ratio (MLR) posterior of a discrete prior distribution Pq with p.m.f. po and a 
discrete likelihood distribution L with p.m.f. p^ if 

. r p{o) . p{o) 

mm < max — -— — mm — -— 

p.m.f.'s p [ eee Po(^)Pl(^) see pQ{e)pL{e) 

is attained hy p = p* (where 0/0 := 1). 

Similarly, an a.c. distribution P* with p.d.f. /* is the MLR posterior of an a.c. prior 
distribution Pq with p.d.f. /o and an a.c. likelihood distribution L with p.d.f. fi if 

mm < ess sup — ess mt ■ 



p.m.f.'s/l^ eee' M0)fLi9) eee /o(^)/l(^) 
is attained by /*. 



The inin-inax terms in Definition 14.11 are similar to the min-max criterion for loss of 
Shannon Information (Theorem 13. 5p . whereas the others are dual max-min criteria. Just as 
the Bayesian posterior minimizes the loss of Shannon information, the Bayesian posterior is 
also the MLR posterior of the prior and likelihood distributions. 

Theorem 4.2. Let Pq and L be compatible discrete or compatible absolutely continuous prior 
and likelihood distributions, respectively. Then the unique MLR posterior for Pq and L is 
the Bayesian posterior distribution Pb- 

Proof. Immediate from [3, Theorem 5.2]. D 

5 Proportional Posteriors 

A criterion similar to likelihood ratios is to require that the posterior distribution Pi reflect 
the relative likelihoods of identical individual outcomes under both Pq and L. For example, 
if the probability that the prior and the (independent) likelihood are both 6a is twice that 
of the probability both are 6b, then Pi{6a) should also be twice as large as Pi{6i,). 

Definition 5.1. A discrete (posterior) probability distribution P* with p.m.f. p* is a pro- 
portional posterior of a discrete prior distribution Pq with p.m.f. po and a compatible discrete 
likelihood distribution L with p.m.f. pi if 

Similarly, a posterior a.c. distribution P* with p.d.f. /* is a proportional posterior of an 
a.c. prior distribution Pq with p.d.f. /q and a compatible likelihood distribution L with p.d.f. 
fiii 
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for (Lebesgue) almost all 6a, 6^ G 0. 



Theorem 5.2. Let Pq and L be compatible discrete or compatible absolutely continuous prior 
and likelihood distributions, respectively. Then the Bayesian posterior distribution Pb is a 
proportional consolidation for Pq and L. 

Proof. Immediate from ^ Theorem 5.5]. D 

6 Optimal Posteriors for Weighted Prior and Likeli- 
hood Distributions 

Definition 6.1. Given a prior distribution Pq with weight Wq > and a likelihood distri- 
bution L with weight wl > 0, the combined weighted Shannon information associated with 
the event A, S(Po^wo;L,wl){A), is 

S{Po,wo;L,WL)iA) = ^ Sp^{A) H ^ SiiA). 



This definition ensures that only the relative weights are important, so for instance if 
^0 = "Wl, the combined weighted Shannon information of the prior and hkehhood always 
coincides with the (unweighted) combined Shannon information of the prior and likelihood. 
Note again that no information is attained by observing any event that is certain to occur, no 
matter what the distributions and weights, since S'pq(6) = 5*^(9) = 0. The next theorem, a 
special case of [H (8)], identifies the posterior distribution that minimizes the loss of weighted 
Shannon information in the case the prior and likelihood distributions are compatible discrete 
distributions; the case for compatible absolutely continuous distributions is analogous. 

Theorem 6.2. Let Pq and L be compatible discrete prior and likelihood distributions with 
p.m.f. 's po and pi and weights wq > and wl > 0, respectively. Then the unique posterior 
distribution that minimizes the maximum loss of Shannon information from the weighted 
prior and likelihood distributions, i.e., that minimizes, among all posterior distributions Pi, 



max 

A 



{S{Po,wo;L,WL)i^) - Sp^iA)} 

is the posterior distribution P^ with p.m.f. 



max liJn 1^ T 



P^O) 



Z]eee(Po(^)) ™'"''™°''"-^' (Pi(^)) '"''''''""■"'■^' 



Remark. If both the prior and likelihood distributions are normally distributed, the Bayesian 
posterior is also a best linear unbiased estimator (BLUE) and a maximum likelihood esti- 
mator (MLE); e.g. see [3]. 
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